Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver: ethernet to allow jumbo frames 9000 on rpi 4B #5419

Closed
wants to merge 5 commits into from
Closed

driver: ethernet to allow jumbo frames 9000 on rpi 4B #5419

wants to merge 5 commits into from

Conversation

delgh1
Copy link

@delgh1 delgh1 commented Apr 2, 2023

known issue

Can set MTU to 9000 but cannot receive a packet larger than 2004 bytes
The situation has not changed since last year (Mar 2022):
https://forums.raspberrypi.com/viewtopic.php?t=330976

Only tested on my rpi 4B rev 1.5. I dare you to merge this.

delgh1 and others added 3 commits April 1, 2023 21:09
Signed-off-by: Jing Luo <[email protected]>
fixup! gpio-fsm: Avoid truncation of delay jiffies
@6by9
Copy link
Contributor

6by9 commented Apr 6, 2023

Changing values in a userspace API (uapi) header is definitely not the right solution.
Likewise changing an in-kernel define used by all drivers is highly unlikely to be correct.

Increasing the buffer size in bcmgenet.c may be valid, but I'd then expect it to be a userspace call to increase the MTU on the interface.

@6by9
Copy link
Contributor

6by9 commented Apr 26, 2023

Same comments.
Increasing RX_BUF_LENGTH in bcmgenet.c superficially appears valid (some reservation over 10240 not being a multiple of a 4kB page size), but changing uapi headers is not acceptable.

What I suspect you intend is to increase ENET_MAX_MTU_SIZE defined in bcmgenet.h

#define ENET_MAX_MTU_SIZE	(ETH_DATA_LEN + ETH_HLEN + VLAN_HLEN + \
				 ENET_BRCM_TAG_LEN + ETH_FCS_LEN + ENET_PAD)

The dependency there on ETH_DATA_LEN is most likely what you want to change.

@delgh1 delgh1 marked this pull request as draft April 26, 2023 13:23
@delgh1
Copy link
Author

delgh1 commented Apr 26, 2023

Oops, sorry

@stevenhoving
Copy link

stevenhoving commented May 11, 2023

So I have been pondering about this thingy for some time. I like the author his "push", and I hope that I am correct to assume it was to trigger a discussion.
So what is really needed to add such a feature to the bcm driver? Currently the driver allocates and initializes the "buffer" on driver instantiation/loading/binding. This of course doesnt allow for resizing in the current implementation. If you look at the intel Ethernet driver implementations you see that there is generic infrastructure for this in place in the Ethernet driver "framework". But I think that requires to refactor/re-engineer the bcm driver so much that it must be upstreamed before it can be considered to be integrated here (I hope I am correct so far).
So in order to pull this off somebody with enough time should:

  • factor out the driver buffer allocation and initialization in a way it can be reused.
  • figure out how to change the size of the frames on the fly safely (I personally think this is the most tricky thing)
  • upstream it, which is also not easy.

Still I think I am assuming a lot here, that it is possible that the Ethernet driver buffer size can be changed on the fly. That the generic infra that I mention isnt something intel specific and that the linux net maintainer is open for such a contribution. But if we dont try we will never know.

So if (a big if) we have a upstream patch accepted by the maintainer, would that mean that it could be considered to be merged?

@6by9
Copy link
Contributor

6by9 commented May 11, 2023

@delgh1 There's no point in updating this PR / your branch with merge commits from rpi-6.1.y. It is not going to be merged in the current form under any circumstances. Dare not accepted.

@stevenhoving Can you point at the framework you're referring to that Intel use?

I've already suggested changing

#define ENET_MAX_MTU_SIZE	(ETH_DATA_LEN + ETH_HLEN + VLAN_HLEN + \
				 ENET_BRCM_TAG_LEN + ETH_FCS_LEN + ENET_PAD)

in bcmgenet.h and replacing the ETH_DATA_LEN with 9000 rather than messing with ETH_DATA_LEN may give the results you desire. No one seems to have chosen to test it. (You also need the change to RX_BUF_LENGTH in bcmgenet.c).
It will consume more memory, but I think we're talking 4 * 32 buffers at ~10kB instead or ~2kB, and losing 1MB is hardly a big issue.
TBH if it works, doing that and sending it upstream would at least start the conversation with the maintainers for bcmgenet. It may have issues on other versions of the bcmgenet hardware though, but could move into the bcmgenet_hw_params array so that it does change based on revision.

IIRC @P33M has worked on bcmgenet in the past and may know of any issues with jumbo frames (or be able to point me at docs).

We're almost always happy to backport patches that have been accepted usptream and add useful functionality. The main criteria being that it doesn't require loads of additional patches to add framework support that is missing in the earlier kernel revision.

@P33M
Copy link
Contributor

P33M commented May 11, 2023

IIRC @P33M has worked on bcmgenet in the past and may know of any issues with jumbo frames (or be able to point me at docs).

My sum total of contributions to this driver are trivial edits to two lines. We don't have documentation other than register definitions (in particular, no architecture reference or programmer's model).

@6by9
Copy link
Contributor

6by9 commented May 11, 2023

IIRC @P33M has worked on bcmgenet in the past and may know of any issues with jumbo frames (or be able to point me at docs).

My sum total of contributions to this driver are trivial edits to two lines. We don't have documentation other than register definitions (in particular, no architecture reference or programmer's model).

Ah, OK. Thanks.

If the simplistic change works, then throwing it upstream to start the conversation sounds like the best plan then.

@graysky2
Copy link

graysky2 commented May 16, 2023

If the simplistic change works, then throwing it upstream to start the conversation sounds like the best plan then.

Has anyone done this? Link?

@delgh1 - have you generated any benchmark data to support that a larger MTU size adds any benefits? I tested this under the 5.4 series of kernels and found that jumboframes on simple file transfers were much slower compared to the default mtu value.

Have you observed any stability regressions? This thread you linked has one report of NIC timeouts that are not automatically recovered/reboot required. Here is another recent thread showing regressions.

@wtschueller
Copy link

wtschueller commented Jul 4, 2023

Larger MTU does have benefit for protocols relying on udp and disallowing fragmentation.
IMHO it is difficult to saturate a 1GBit/s link with udp packets of only 1500 bytes.

I did the simplistic changes for a 5.10.y kernel. It did not work for me:

  1. checksum offloading seems to corrupt outgoing packets
  2. incoming packets will be truncated if receive status blocks are used
  3. transmission time out may occur if packets of specific lengths are transmitted
  4. I found an additional phy setting. I could not observe any effect.

As far as I can see I am the only source of these regression reports.
I did the regression test on a single piece of hardware.
I fixed 1. and 2. by rolling back a single commit and disabling hardware crc.
I did not found a good solution for 3. I decided to just pad those packets.
However, the fixes do work on several samples of RPi4 and CM4 since then.
The fixes conflict even for 5.15.y.

So I would really appreciate if someone could repeat it for a recent kernel.

@6by9
Copy link
Contributor

6by9 commented Jul 4, 2023

I did the simplistic changes for a 5.10.y kernel. It did not work for me:

1. checksum offloading seems to corrupt outgoing packets

2. incoming packets will be truncated if receive status blocks are used

3. transmission time out may occur if packets of specific lengths are transmitted

4. I found an additional phy setting. I could not observe any effect.

As far as I can see I am the only source of these regression reports. I did the regression test on a single piece of hardware. I fixed 1. and 2. by rolling back a single commit and disabling hardware crc. I did not found a good solution for 3. I decided to just pad those packets. However, the fixes do work on several samples of RPi4 and CM4 since then. The fixes conflict even for 5.15.y.

You've not given any details of what commit you reverted, fixes applied, or where you're getting a conflict.

So I would really appreciate if someone could repeat it for a recent kernel.

How can we repeat something we have no details over?

There is a proposal here of potentially how jumbo frame support might be implemented, or at least a suggestion how the development might continue. Testing other unrelated suggestions should be on the forums or an issue thread, not on a PR.

@wtschueller
Copy link

Well, here are more details:

  1. start with the original patch (uapi changes) or increase the buffer only and advertise a larger max_mtu (as already suggested here)
  2. try sending jumbo udp packets, check whether they get corrupted
  3. check the effect of checksum offloading on/off (with ethtools)
  4. try receiving jumbo udp packets, this may require extra logging of the received dma blocks inside the driver
  5. roll back net: bcmgenet: always enable status blocks (sorry for the url typo in older posts)
  6. repeat 2./3./4.
  7. stress test, start reading here: eth0 (bcmgenet): transmit queue timed out #4485 (comment)

I think that all working solution with jumbo frames base on 5.4 kernels or similar with crc offloading disabled by default.
See also here https://support.thinklucid.com/knowledgebase/jumbo-frames-on-raspberry-pi/.

@delgh1 delgh1 closed this Jul 5, 2023
@delgh1 delgh1 deleted the dev-rpi-6.1 branch July 5, 2023 13:20
@6by9
Copy link
Contributor

6by9 commented Jul 5, 2023

@delgh1 Does you closing this PR mean you no longer have any interest in the issue?

@delgh1
Copy link
Author

delgh1 commented Jul 5, 2023

@delgh1 Does you closing this PR mean you no longer have any interest in the issue?

@6by9 Unfortunately yes. Due to various personal health issues, I will be focus on other things for the time I have left to live.
Thank you for the contributions btw.

@6by9
Copy link
Contributor

6by9 commented Jul 7, 2023

I've had a quick hack around.

The core validates whether the requested MTU is within the range defined by the card (https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L8672). The default is set to ETH_DATA_LEN at https://elixir.bootlin.com/linux/latest/source/net/ethernet/eth.c#L365, therefore the hack was bumping that up too.
That means adding a fragment

@@ -4056,6 +4056,10 @@ static int bcmgenet_probe(struct platform_device *pdev)
        priv->tx_pause = 1;
        priv->rx_pause = 1;
 
+       dev->mtu = ETH_DATA_LEN;
+       dev->min_mtu = ETH_MIN_MTU;
+       dev->max_mtu = ENET_MAX_MTU_SIZE - VLAN_ETH_HLEN;
+
        SET_NETDEV_DEV(dev, &pdev->dev);
        dev_set_drvdata(&pdev->dev, dev);
        dev->watchdog_timeo = 2 * HZ;

to bcmgenet.c in addition to the mods for ENET_MAX_MTU_SIZE and RX_BUF_LENGTH previously discussed.

That still doesn't work on 6.1 though, so it'd be interesting to try them on 5.15 if that is supposed to previously have worked with the hack.

@wtschueller
Copy link

What exactly does not work? Does a MTU 1900 work? I would expect it to work up to nearly 2k.

@6by9
Copy link
Contributor

6by9 commented Jul 10, 2023

I'm using a Netgear GS110TP with max frame size on all ports set to 9000.
Take these changes and reboot. I then can't SSH in unless you reset the MTU to 1500 via the console of the Pi. Admittedly I am SSHing from an Ubuntu VM running on a Windows host with Asix AX88179 ethernet interface, but it claims that it supports jumbo frames, and accepts setting an mtu of 9000.

Set the Pi MTU to 2000.
ping 192.168.2.253 -c 20 -M do -s 1472 (total of 1500 byte frame) works.
ping 192.168.2.253 -c 20 -M do -s 1473 (total of 1501 byte frame) fails.

I've got another Asix AX88179 based adapter around, so I'll confirm whether that behaves itself in this setup.

@6by9
Copy link
Contributor

6by9 commented Jul 10, 2023

Port stats on the Netgear switch give a combined TX & RX for packets of >1522.

For an AX88179 on a Pi it only increments by 1 per message when pinging the switch with a packet of 1528 bytes (1500 byte payload).
With the proposed patches, I get the same behaviour. That would indicate to me that whilst it will switch jumbo packets, the switch's IP stack doesn't support them.

Setting up a second Pi with the patches does respond to pings up to -s 2004, so it is a test infrastructure issue. Confirmed with the AX88179 pinging a CM4 as well.

That was on 5.15.92. Now retesting with the clean patches on 6.4.1.

@6by9
Copy link
Contributor

6by9 commented Jul 10, 2023

So 6.4.1 with the diff

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 8a69d8f24c53..d1bdd4da4c75 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -52,7 +52,7 @@
 #define GENET_Q16_TX_BD_CNT    \
        (TOTAL_DESC - priv->hw_params->tx_queues * priv->hw_params->tx_bds_per_q)
 
-#define RX_BUF_LENGTH          2048
+#define RX_BUF_LENGTH          10240
 #define SKB_ALIGNMENT          32
 
 /* Tx/Rx DMA register offset, skip 256 descriptors */
@@ -4056,6 +4056,10 @@ static int bcmgenet_probe(struct platform_device *pdev)
        priv->tx_pause = 1;
        priv->rx_pause = 1;
 
+       dev->mtu = ETH_DATA_LEN;
+       dev->min_mtu = ETH_MIN_MTU;
+       dev->max_mtu = ENET_MAX_MTU_SIZE - VLAN_ETH_HLEN;
+
        SET_NETDEV_DEV(dev, &pdev->dev);
        dev_set_drvdata(&pdev->dev, dev);
        dev->watchdog_timeo = 2 * HZ;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index 4a73c1572398..ac30b7866463 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -29,7 +29,7 @@
  */
 #define ENET_BRCM_TAG_LEN      6
 #define ENET_PAD               8
-#define ENET_MAX_MTU_SIZE      (ETH_DATA_LEN + ETH_HLEN + VLAN_HLEN + \
+#define ENET_MAX_MTU_SIZE      (9000 + ETH_HLEN + VLAN_HLEN + \
                                 ENET_BRCM_TAG_LEN + ETH_FCS_LEN + ENET_PAD)
 #define DMA_MAX_BURST_LENGTH    0x08

works up to -s 2004 between a Pi4 and CM4, without the hacking around in UAPI or other kernel headers.
The default MTU is the standard 1500, so needs to be increased to allow for bigger frames with sudo ip link set dev eth0 mtu 9000 or similar.

For >2004 byte packets, the switch port stats again show that the counters increase on tx, but no rx.
Ping from CM4 to AX88179 on Pi4, and tcpdump on Pi4 shows the 3000byte pings coming in, and replies being sent. That would largely confirm that it's a receive issue.
Turning all the offload features off doesn't seem to help. I do note that ip -s link show dev eth0 increments "errors" and "missed" for each "missed" ping response. It'll take a little digging into the conditions that results in those being incremented, and confirming if they are generic with all offload settings.

@wtschueller
Copy link

Good Job! This coincidences with my findings when I ported the patch from 5.4 to 5.10.

Does ping -s completely check the packet content? With hardware CRC enabled, It seemed to me that the hardware crc engine puts the crc at the wrong place into large enough outging packets. This could be also some misconfiguration. IIRC, I saw this using udp packets.

I added a lot of logging code and I found that larger packets get split into multiple dma blocks upon reception. I could avoid this only by disabling receive status blocks. Again, maybe this is a misconfiguration.

Finally I decided to roll back the commit introducing unconditional status blocks as a quick fix over concatenating the dma blocks again. I could not solve the hardware crc issue upon sending anyway. I could not guess the correct settings just from the source code.

@wtschueller
Copy link

side note: https://www.asix.com.tw/en/product/USBEthernet/Super-Speed_USB_Ethernet/AX88179 says

  • Supports Jumbo frame up to 4KB

@6by9
Copy link
Contributor

6by9 commented Jul 11, 2023

side note: https://www.asix.com.tw/en/product/USBEthernet/Super-Speed_USB_Ethernet/AX88179 says

* Supports Jumbo frame up to 4KB

I know, although I was reading https://github.com/torvalds/linux/blob/master/drivers/net/usb/ax88179_178a.c#L1283

dev->net->max_mtu = 4088;

I'm generally using a ping payload size of 3000.

@6by9
Copy link
Contributor

6by9 commented Jul 11, 2023

So the driver is dropping what it sees as a fragmented packet. The error path at https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/broadcom/genet/bcmgenet.c#L2325

For my 3000 byte payload ping, dma_length_status is 0x08403f80. length 0x840 is 2112. 0x3f80 is DMA_SOP set (0x2000), but not DMA_EOP (0x4000).
2112 is 2048 + 64 header.

There is a register "GENET_0_RBUF_PKT_RDY_THLD - RX Buffer Packet Ready Threshold Register" at offset 0x08 which is the threshold for PKT_RDY being asserted, in units of 16bytes. Default is 0x80, which * 16bytes = 2048.
However it's only an 8bit value, so max 255 * 16 = 4080bytes, however recommended to be divisible by the burst size (256B).
Increase it to 0xf0 for 3840 bytes, and I can receive up to 3796 byte ping payloads. dma_length_status is 0x0f407f80.

Patch fragments

@@ -2568,6 +2570,9 @@ static void init_umac(struct bcmgenet_priv *priv)
        reg |= RBUF_ALIGN_2B | RBUF_64B_EN;
        bcmgenet_rbuf_writel(priv, reg, RBUF_CTRL);
 
+       reg = 0xf0;
+       bcmgenet_rbuf_writel(priv, reg, RBUF_PKT_RDY_THLD);
+
        /* enable rx checksumming */
        reg = bcmgenet_rbuf_readl(priv, RBUF_CHK_CTRL);
        reg |= RBUF_RXCHK_EN | RBUF_L3_PARSE_DIS;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index d48d64be7010..db4abd9d95c1 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -192,6 +192,8 @@ struct bcmgenet_mib_counters {
 #define  RBUF_ALIGN_2B                 (1 << 1)
 #define  RBUF_BAD_DIS                  (1 << 2)
 
+#define RBUF_PKT_RDY_THLD              0x08
+
 #define RBUF_STATUS                    0x0C
 #define  RBUF_STATUS_WOL               (1 << 0)
 #define  RBUF_STATUS_MPD_INTR_ACTIVE   (1 << 1)

From the limited docs I have, that threshold is common to whether the 64bit status block is present or not, although the 64byte header is meant to be taken into account if present (Should be max 0x7c when using 2kB buffers for 1984byte payload and 64byte header = 2048bytes total. The current driver doesn't do that, but it also has a max mtu of 1518). That makes me wonder over your report that it was working on 5.4 if the status buffers were disabled.

I'll create a PR with my cleaned up diffs that give 3840byte support, and then anyone can try them.

I'm running out of ideas with regard how to reassemble the fragmented packets for more than 3840 bytes. I'm sure the networking stack has a mechanism to deal with it, but don't know enough about how it functions. There are features of NETIF_F_FRAGLIST and NETIF_F_SG (scatter-gather), but brief reading implies they impact transmit instead of receive. If we can chain the skbs for the received fragments, then that might work. An email to net-dev is probably warranted.

@6by9
Copy link
Contributor

6by9 commented Jul 11, 2023

PR #5534

If you want to try it out, then once CI has built it, sudo rpi-update pulls/5534 should get you a kernel with those patches. Normal warnings apply over using rpi-update only on non-critical devices.

@wtschueller
Copy link

wtschueller commented Jul 12, 2023

Now I'm amazed that the orginal patch did work at all, too. Thankfully there are other reports besides mine.
I will definitely try your patches. Unfortunately, it can be only in August.

Are there other settings related to outgoing large packets that could explain my observation of packet corruption?

Did you also test other protocols like xrdp or rcp? I think it is unlikely that ssh uses a lot of jumbo frames
and I also remember that tcp is pretty robust. If larger packets are damaged or lost, the window size is reduced
and maybe it masks transmission problems. At least for sufficiently small data rate.

I also considered reassembling incoming packets.
This https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000/e1000_main.c#L4196
made me think that it cannot be too hard. There was a far better example that I cannot find it again.
Maybe one of the FPGA based MACs that has very limited DMA block size and needs reassembling even for MTU 1500.

@6by9
Copy link
Contributor

6by9 commented Jul 12, 2023

I was largely testing with ping, although I did check with iperf3 (in UDP mode), and I was assuming that one of them was checking the data was correct. I wasn't observing buffers being discarded due to corruption.
At a guess you were seeing issues due to the default PKT_RDY threshold being 0x80 (2048 bytes), when it needed to consider the 64byte status block header.

Seeing as we can tell the hardware that it has 10kB buffers, it feels like we shouldn't need to handle fragmented receives. However it depends on how the hardware handles things when it hits that PKT_RDY threshold - does it switch to a new buffer, or just notifies that data is ready and keep going with the same one? If the latter, then why does it support 16kB buffers in the rings? Those are questions for those who support the hardware as I don't have that information.

e1000 is having to mess with passing page pointers into skb_fill_page_desc. Currently genet is largely ignoring pages as (AIUI) it uses the raw ring buffer to pass the data. Having to swap is likely to be fairly significant.

Disabling the status block shouldn't be too difficult as a hack, but I need to move onto other things right now. It also loses all the hardware checksum stuff, so may have a negative effect on performance overall.

@wtschueller
Copy link

Thank you so far, 6by9. I think it is a huge leap forward anyway.

As for the status blocks, I agree. Especially as long as it is unclear why status blocks make any difference.

Could you still have a look at your documentation if there is also a tx threshold?
From other hardware I would expect settings for store-and-forward or send-through modes.

@6by9
Copy link
Contributor

6by9 commented Jul 12, 2023

There is also a TBUF_TXCHK_PKTRDY_THLD register that has almost the same description as RBUF_PKT_RDY_THLD.

This is a threshold for which PKT_RDY is asserted

Units of 16bytes, default of 0x80.
I have no information on what exactly PKT_RDY does for us under rx or tx, but I would have read it as triggering the interrupt

I wonder if reading the status register (instead of looking at the status block) is just a timing thing.
1000Mbit/s is 125Mbytes/s (ignoring any overheads). A 9000 byte packet therefore takes 72usecs to be received.
If the hardware is completing the status block then the value is fixed when it starts writing it. If software has to read it from the register, then I wouldn't be surprised if it took more than 72us to get around to it, therefore the register reflects the complete frame having been received.

Time for a quick hack to read the register if we're in the fragmented packet case, and see if that magically makes things work.....

@wtschueller
Copy link

Another wild guess: The threshold values and PKT_RDY trigger the DMA engine for incoming packets and the transmit logic for outgoing ones. Assuming that there are extra FIFOs to receive from or send to the wire. However, this does not explain why incoming packets get split into multiple dma buffers at exactly that threshold. What does the documentation say about threshold == 0? Or isn't it allowed?

I think I don't get what you mean with the timing thing. I already benchmarked outgoing packets rate and saw more than 50k packets/s and also more than 50k interrupts/s. So less than 20µsec per packet. IIRC larger incoming packets consume multiple dma blocks and the hardware also indicates that it has consumed more than one block.

So the driver is dropping what it sees as a fragmented packet. The error path at https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/broadcom/genet/bcmgenet.c#L2325

For my 3000 byte payload ping, dma_length_status is 0x08403f80. length 0x840 is 2112. 0x3f80 is DMA_SOP set (0x2000), but not DMA_EOP (0x4000). 2112 is 2048 + 64 header.

My guess is that the EOP flag is set in the next dma buffer holding the remaining bytes. Currently, the driver requires both flags in a single dma block.

@6by9
Copy link
Contributor

6by9 commented Jul 12, 2023

Forgot to hit the "comment" button before I left the office.

Reading the register when DMA_EOF isn't set does give me a len_stat value with DMA_EOF set. With a ping -s 3900 (to stay below the AX88179's limit)

[  307.274090] bcmgenet: bcmgenet_rx_refill: ffffffc008b22864, 0x000000004c9c0040
[  307.274128] bcmgenet: bcmgenet_desc_rx:p_ind=180 c_ind=179 read_ptr=179 len_stat=0x0f403f80
[  307.274148] bcmgenet: bcmgenet_desc_rx:reread p_ind=180 c_ind=179 read_ptr=179 len_stat=0x0fe87f80
[  308.287339] bcmgenet: bcmgenet_rx_refill: ffffffc008b22870, 0x000000004c9c4040
[  308.287384] bcmgenet: bcmgenet_desc_rx:p_ind=181 c_ind=180 read_ptr=180 len_stat=0x0f403f80
[  308.287408] bcmgenet: bcmgenet_desc_rx:reread p_ind=181 c_ind=180 read_ptr=180 len_stat=0x0fe87f80
[  309.311292] bcmgenet: bcmgenet_rx_refill: ffffffc008b2287c, 0x000000004c9c8040
[  309.311332] bcmgenet: bcmgenet_desc_rx:p_ind=182 c_ind=181 read_ptr=181 len_stat=0x0f403f80
[  309.311354] bcmgenet: bcmgenet_desc_rx:reread p_ind=182 c_ind=181 read_ptr=181 len_stat=0x0fe87f80
[  310.335301] bcmgenet: bcmgenet_rx_refill: ffffffc008b22888, 0x000000004c9cc040
[  310.335340] bcmgenet: bcmgenet_desc_rx:p_ind=183 c_ind=182 read_ptr=182 len_stat=0x0f403f80
[  310.335365] bcmgenet: bcmgenet_desc_rx:reread p_ind=183 c_ind=182 read_ptr=182 len_stat=0x0fe87f80

Status block length of 0xf40 would match rx threshold of 0xf0 + 64 (0x40) bytes of status block.
Register value of 0xfe8 (4072) would be a payload of 4008 bytes + 64 bytes of status block. That seems a large payload for a 3900 ping payload + header.
Maybe not as a ping payload of 3796 bytes gives a received length of 0xf40 (3904), or 108 byte overhead.

So we're getting a packet in, and the driver is kicking it upwards, but it appears that something in the stack is dropping it before it gets to my client app (ping). Presumably it's a checksum failing somewhere as the offload isn't being performed (I believe I'm telling the stack that the hardware hasn't done it)

Diff

diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 400d5528cfdc..49ea852ba379 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -99,6 +99,12 @@ static inline void dmadesc_set_length_status(struct bcmgenet_priv *priv,
        bcmgenet_writel(value, d + DMA_DESC_LENGTH_STATUS);
 }
 
+static inline u32 dmadesc_get_length_status(struct bcmgenet_priv *priv,
+                                           void __iomem *d)
+{
+       return bcmgenet_readl(d + DMA_DESC_LENGTH_STATUS);
+}
+
 static inline void dmadesc_set_addr(struct bcmgenet_priv *priv,
                                    void __iomem *d,
                                    dma_addr_t addr)
@@ -2302,13 +2308,6 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring,
 
                status = (struct status_64 *)skb->data;
                dma_length_status = status->length_status;
-               if (dev->features & NETIF_F_RXCSUM) {
-                       rx_csum = (__force __be16)(status->rx_csum & 0xffff);
-                       if (rx_csum) {
-                               skb->csum = (__force __wsum)ntohs(rx_csum);
-                               skb->ip_summed = CHECKSUM_COMPLETE;
-                       }
-               }
 
                /* DMA flags and length are still valid no matter how
                 * we got the Receive Status Vector (64B RSB or register)
@@ -2320,6 +2319,34 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring,
                          __func__, p_index, ring->c_index,
                          ring->read_ptr, dma_length_status);
 
+               if (!(dma_flag & DMA_EOP) || !(dma_flag & DMA_SOP)) {
+                       if (!(dma_flag & DMA_EOP)) {
+                               dma_length_status =
+                                       dmadesc_get_length_status(priv, cb->bd_addr);
+                               dma_flag = dma_length_status & 0xffff;
+                               len = dma_length_status >> DMA_BUFLENGTH_SHIFT;
+                               status->rx_csum = 0;
+                               pr_err("%s:reread p_ind=%d c_ind=%d read_ptr=%d len_stat=0x%08x\n",
+                                         __func__, p_index, ring->c_index,
+                                         ring->read_ptr, dma_length_status);
+                       }
+
+                       if (unlikely(!(dma_flag & DMA_EOP) || !(dma_flag & DMA_SOP))) {
+                               pr_err("dropping fragmented packet!\n");
+                               ring->errors++;
+                               dev_kfree_skb_any(skb);
+                               goto next;
+                       }
+               }
+
+               if (dev->features & NETIF_F_RXCSUM) {
+                       rx_csum = (__force __be16)(status->rx_csum & 0xffff);
+                       if (rx_csum) {
+                               skb->csum = (__force __wsum)ntohs(rx_csum);
+                               skb->ip_summed = CHECKSUM_COMPLETE;
+                       }
+               }
+
                if (unlikely(len > RX_BUF_LENGTH)) {
                        pr_err("oversized packet\n");
                        dev->stats.rx_length_errors++;
@@ -2328,13 +2355,6 @@ static unsigned int bcmgenet_desc_rx(struct bcmgenet_rx_ring *ring,
                        goto next;
                }
 
-               if (unlikely(!(dma_flag & DMA_EOP) || !(dma_flag & DMA_SOP))) {
-                       netif_err("dropping fragmented packet!\n");
-                       ring->errors++;
-                       dev_kfree_skb_any(skb);
-                       goto next;
-               }
-
                /* report errors */
                if (unlikely(dma_flag & (DMA_RX_CRC_ERROR |
                                                DMA_RX_OV |

@6by9
Copy link
Contributor

6by9 commented Jul 12, 2023

I think I don't get what you mean with the timing thing. I already benchmarked outgoing packets rate and saw more than 50k packets/s and also more than 50k interrupts/s. So less than 20µsec per packet. IIRC larger incoming packets consume multiple dma blocks and the hardware also indicates that it has consumed more than one block.

50k 9000byte packets/s would be 3.6Gbit/s based on 8 transmitted bits per useful byte (ie no sync bits) - that's not going to happen on a 1Gbit/s link.
I'm thinking of interrupt latency. With the status block the flags field is being set by the hardware at the trigger point. If reading the register, then the CPU has to have responded to the interrupt, notify the network stack, wait for it to poll the interface, and then read the status value from the register. It wouldn't be the full frame time as it's triggered after (now) 3840 of potentially 9000 have been received.
As noted above, doing that appears to give the updated flags at least when only slightly above the threshold value. I'd need a usable interface that supports full 9000 byte MTU to really experiment.

My guess is that the EOP flag is set in the next dma buffer holding the remaining bytes. Currently, the driver requires both flags in a single dma block.

The hardware writes directly into RAM (in 256 byte bursts), and is given 10kB per buffer in the ring. It has no need to move on to the next buffer when it hit the threshold trigger, but I don't see how it can write a valid status block to the start of the overall buffer if it writes linearly. Perhaps it does have an internal buffer for the entire frame in the network block.
I'd need to look again at the logging when receiving multiple fragments.

@wtschueller
Copy link

I did the benchmark with a commercial application sending udp packets and increasing packet size. For packets of 1500 bytes, the application cannot keep up with wire speed, the tx ring gets empty and every packet gets an interrupt. As soon as the application can keep up with wire speed (by larger packets), the interrupt rate falls greatly because the hardware issues an interrupt only every 10th packet (IIRC).

Your assumption that the status block is written before the end of the packet has been seen on the wire is convincing to me. If there is some limited FIFO in the system, then smaller packets can be received entirely before transmitted to RAM, jumbo frames not.

Anyway, from your logging of dma blocks it seems that the incoming packet is not scattered. Only the status block lacks the EOP flag. This does not match my memories but the assumption above. And it is even easier for the driver.

@6by9
Copy link
Contributor

6by9 commented Aug 2, 2023

https://www.spinics.net/lists/netdev/msg919558.html
https://www.spinics.net/lists/netdev/msg925029.html
So Broadcom are going to have a little dig into it.

I'll create a new issue to allow tracking of any progress - closed PRs will tend to be ignored.

@6by9
Copy link
Contributor

6by9 commented Aug 2, 2023

https://www.spinics.net/lists/netdev/msg919558.html
https://www.spinics.net/lists/netdev/msg925029.html
So Broadcom are going to have a little dig into it.

I've created a new issue to allow tracking of any progress - #5561. Closed PRs will tend to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants